Distortion-Resistant Hashing for rapid search of similar DNA subsequence
نویسنده
چکیده
One of the basic tasks in bioinformatics is localizing a short subsequence S, read while sequencing, in a long reference sequence R, like the human geneome. A natural rapid approach would be finding a hash value for S and compare it with a prepared database of hash values for each of length |S| subsequences of R. The problem with such approach is that it would only spot a perfect match, while in reality there are lots of small changes: substitutions, deletions and insertions. This issue could be repaired if having a hash function designed to tolerate some small distortion accordingly to an alignment metric (like Needleman-Wunch): designed to make that two similar sequences should most likely give the same hash value. This paper discusses construction of Distortion-Resistant Hashing (DRH) to generate such fingerprints for rapid search of similar subsequences. The proposed approach is based on the rate distortion theory: in a nearly uniform subset of length |S| sequences, the hash value represents the closest sequence to S. This gives some control of the distance of collisions: sequences having the same hash value.
منابع مشابه
Modified DNA Extraction for Rapid PCR Detection of Methicillin-Resistant Staphylococci
Nosocomial infection caused by methicillin-resistant staphylococci poses a serious problem in many countries. The aim of this study was to rapidly and reliably detect methicillin-resistant-staphylococci in order to suggest appropriate therapy. The presence or absence of the methicillin-resistance gene in 115 clinical isolates of Staphylococcus aureus and 50 isolates of Coagulase Negative Staphy...
متن کاملRHash: Robust Hashing via `∞-norm Distortion
Hashing is an important tool in large-scale machine learning. Unfortunately, current data-dependent hashing algorithms are not robust to small perturbations of the data points, which degrades the performance of nearest neighbor (NN) search. The culprit is the minimization of the `2-norm, average distortion among pairs of points to find the hash function. Inspired by recent progress in robust op...
متن کاملA Single Index Approach for Distortion-Free Time-Series Subsequence Matching
In this paper we propose a new method for distortionfree time-series subsequence matching. Our method is distortion-free in the sense that it performs preprocessing on time-series to remove the distortions of offset translation and amplitude scaling at the same time. We call this preprocessing as normalization transform in this paper. Previous work on the normalization-transformed subsequence m...
متن کاملApproximate Substructure Searchin a Database of 3 D Graphs
Given a database D of three dimensional (3D) graphs and a query graph Q, the problem of substructure search is deened as nding the graphs in D that contain Q. This is an important search operation in scientiic databases. This paper extends the search operation to nd those graphs D in D that \approximately" contain Q in the presence of rotation , translation, distortion, and node insert/delete i...
متن کاملDetection of Isoniazid-Resistant Clinical isolates of Mycobacterium tuberculosis from India using Ser315Thr marker by Comparison of molecular methods
In this study, Substitution at codon Ser315 of katG gene, a reliable marker for isoniazid (INH) resistance was analyzed and compared by three molecular methods such as DNA sequencing, polymerase chain reaction restriction fragment length polymorphism (PCR-RFLP) and PCR-single strand conformation polymorphism (PCR-SSCP) in 105 phenotypically resistant isolates obtained from various parts of Ind...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1602.05889 شماره
صفحات -
تاریخ انتشار 2016